Skip to main content

Format Reference

Data Formats:
ARC File Format
DAT File Format
CDX File Format

CDX File Format

A CDX file consists of individual lines of text, each of which summarizes a single web document.
The first line in the file is a legend for interpreting the data, and the following lines contain the data for referencing the corresponding pages within the host. The first character of the file is the field delimiter used in the rest of the file. This is followed by the literal "CDX" and then individual field markers as defined below.

The following is a sample from a CDX file:

CDX A b e a m s c k r V v D d g M n 20010424210551 text/html 200 58670fbe7432c5bed6f3dcd7ea32b221 a725a64ad6bb7112c55ed26c9e4cef63 - 17130110 59129865 1927657 6501523 DE_crawl6.20010424210458 - 5750 20010424210312 text/html 200 d520038e97d7538855715ddcba613d41 30025030eeb72e9345cc2ddf8b5ff218 - 47392928 145482381 4426829 15345336 DE_crawl3.20010424210104 - 6356 20010424212403 text/html 200 52242643710547ff4ce2605ed03ed9e2 b06d037c06e7ffd7afc6db270aca7645 - 21301376 62305547 1855363 6627262 DE_crawl6.20010424212307 - 6317

CDX Data Specifications